library(MASS)
library(dplyr)
The following variable screening method is called stepwise regression. This type of model selection adds and drops variables in the model until it finds the one with the most significant variables in it. The measure it uses to predict significance of the chosen variables is the AIC, this number gets smaller and smaller as more significant variables are added and then also penalizes the models for having too many variables in it.
We wanted to predict a teams post-season conference ranking with pre-season statistics, aquired from a large dataset being used for a senior thesis.
X_CFBRversionSEC <- X_CFBRversion %>%
filter(SEC == 1)
#Stepwise Regression
sw<-step(lm(EPSNrank~FrNbrRecruits+Fr5star+Fr4star+Fr3star+Fravg+Sonbrrecruits+So5star+So4star+So3star+Soavg+Jrnbrrecruits+Jr5star+Jr4star+Jr3star+Jravg+Srnbrrecruits+Sr5star+Sr4star+Sr3star+Sravg+Rssrnbrrecruits+Rssr5star+Rssr4star+Rssr3star+Rssravg+z_lysagarin+z_tyasagarin+retoff+retdef+qbret+bowl+bowlwin+coachexp_school+coachexp_total+BigTen+SEC+BigTwelve+ACC+PacTen+Bigeast, data = X_CFBRversionSEC), direction = c("both"))
X_CFBRversionACC <- X_CFBRversion %>%
filter(ACC == 1)
#Stepwise Regression
sw<-step(lm(EPSNrank~FrNbrRecruits+Fr5star+Fr4star+Fr3star+Fravg+Sonbrrecruits+So5star+So4star+So3star+Soavg+Jrnbrrecruits+Jr5star+Jr4star+Jr3star+Jravg+Srnbrrecruits+Sr5star+Sr4star+Sr3star+Sravg+Rssrnbrrecruits+Rssr5star+Rssr4star+Rssr3star+Rssravg+z_lysagarin+z_tyasagarin+retoff+retdef+qbret+bowl+bowlwin+coachexp_school+coachexp_total+BigTen+SEC+BigTwelve+ACC+PacTen+Bigeast, data = X_CFBRversionACC), direction = c("both"))
X_CFBRversionBigTen <- X_CFBRversion %>%
filter(BigTen == 1)
#Stepwise Regression
sw<-step(lm(EPSNrank~FrNbrRecruits+Fr5star+Fr4star+Fr3star+Fravg+Sonbrrecruits+So5star+So4star+So3star+Soavg+Jrnbrrecruits+Jr5star+Jr4star+Jr3star+Jravg+Srnbrrecruits+Sr5star+Sr4star+Sr3star+Sravg+Rssrnbrrecruits+Rssr5star+Rssr4star+Rssr3star+Rssravg+z_lysagarin+z_tyasagarin+retoff+retdef+qbret+bowl+bowlwin+coachexp_school+coachexp_total+BigTen+SEC+BigTwelve+ACC+PacTen+Bigeast, data = X_CFBRversionBigTen), direction = c("both"))
X_CFBRversionPacTen <- X_CFBRversion %>%
filter(PacTen == 1)
#Stepwise Regression
sw<-step(lm(EPSNrank~FrNbrRecruits+Fr5star+Fr4star+Fr3star+Fravg+Sonbrrecruits+So5star+So4star+So3star+Soavg+Jrnbrrecruits+Jr5star+Jr4star+Jr3star+Jravg+Srnbrrecruits+Sr5star+Sr4star+Sr3star+Sravg+Rssrnbrrecruits+Rssr5star+Rssr4star+Rssr3star+Rssravg+z_lysagarin+z_tyasagarin+retoff+retdef+qbret+bowl+bowlwin+coachexp_school+coachexp_total+BigTen+SEC+BigTwelve+ACC+PacTen+Bigeast, data = X_CFBRversionPacTen), direction = c("both"))
X_CFBRversionBigTwelve <- X_CFBRversion %>%
filter(BigTwelve == 1)
#Stepwise Regression
sw<-step(lm(EPSNrank~FrNbrRecruits+Fr5star+Fr4star+Fr3star+Fravg+Sonbrrecruits+So5star+So4star+So3star+Soavg+Jrnbrrecruits+Jr5star+Jr4star+Jr3star+Jravg+Srnbrrecruits+Sr5star+Sr4star+Sr3star+Sravg+Rssrnbrrecruits+Rssr5star+Rssr4star+Rssr3star+Rssravg+z_lysagarin+z_tyasagarin+retoff+retdef+qbret+bowl+bowlwin+coachexp_school+coachexp_total+BigTen+SEC+BigTwelve+ACC+PacTen+Bigeast, data = X_CFBRversionBigTwelve), direction = c("both"))
So through this process, we found that different variables were deemed significant in different conferences. Essentially it means that different things are necessary to be successful relative to the other teams in your conference, based on a given conference.
SEC: “Fr5star”,“Fr4star”,“So4star”,“Soavg”,“Jrnbrrecruits”,“Jr5star”,“Jr4star”,“Rssr5star”,“Rssr4star”,“z_lysagarin”,“coachexp_school”,“coachexp_total”
ACC:“FrNbrRecruits” , “Fr4star” , “Sonbrrecruits”, “Jrnbrrecruits”, “Jr5star”, “Jr4star”, “Jr3star”, “Jravg”, “Rssr4star”, “z_lysagarin”,“retoff”, “bowl” ,“coachexp_total”
BigTen: “FrNbrRecruits”,“Fr3star”,“So4star”,“Jrnbrrecruits”,“Jr5star”,“Jr4star”,“Jr3star”,“Jravg”,“Srnbrrecruits”,“Sr3star”,“z_lysagarin”,“coachexp_total”
PacTen:“FrNbrRecruits” , “Fr4star” , “Sonbrrecruits”, “So4star”, “Soavg”, “Jrnbrrecruits” , “Jravg” , “Sr3star” , “Rssrnbrrecruits” , “Rssr5star” , “Rssr4star” , “Rssr3star” , “Rssravg” , “z_lysagarin” , “z_tyasagarin” , “retdef” , “qbret” , “coachexp_school” , “coachexp_total” , “Sravg”
BigTwelve:“FrNbrRecruits”, “Fr4star”, “Fr3star”, “Fravg”, “Sonbrrecruits”,“Soavg” , “Jrnbrrecruits” , “Jr5star” , “Srnbrrecruits” , “Rssrnbrrecruits” , “Rssravg” , “z_lysagarin” , “coachexp_school” , “coachexp_total”
Next, we subsetted the data for each conference so it only contained the variables that were deemed important in the variable screening process.
This facilitated the process of creating a function that will take in a string as a conference (i.e. “SEC”) and a year and outputs that conference’s rankings for that year
The function subsets the data into the conference you specify, and then subsets it into 2 datasets: one that contains all the data from every year except the year you specified (this is your training data), and one that is solely the data of year you specified (this is your test data). The multliple linear regression model is fit based on the training data, and then you use that model to predict the test data.
SECsubset <- as.data.frame(X_CFBRversion[,c(1,2,3,7,8,14,16,18,19,20,31,32,38,46,47,50)])
SECsubset<- SECsubset %>%
filter(SEC == 1)
ACCsubset <- as.data.frame(X_CFBRversion[,c(1,2,3,6,8,12,18,19,20,21,22,32,38,41,44,47,52)])
ACCsubset<- ACCsubset %>%
filter(ACC == 1)
PAC10subset <- as.data.frame(X_CFBRversion[,c(1,2,3,6,8,12,14,16,18,22,27,30,31,32,33,34,38,40,42,43,46,47,28,53)])
PAC10subset<- PAC10subset %>%
filter(PacTen == 1)
Big10subset <- as.data.frame(X_CFBRversion[,c(1,2,3,6,9,14,18,19,20,21,22,24,27,38,47,49)])
Big10subset<- Big10subset %>%
filter(BigTen == 1)
Big12subset <- as.data.frame(X_CFBRversion[,c(1,2,3,6,8,9,10,12,16,18,19,24,30,34,38,46,47,51)])
Big12subset<- Big12subset %>%
filter(BigTwelve == 1)
predRank<- function(x,y)
{
dat<-subset(SECsubset, SECsubset$Year != y)
newdat<-subset(SECsubset, SECsubset$Year == y)
if(x=="SEC") {
colnames(dat) <- c("Team","Year","EPSNrank","Fr5star","Fr4star","So4star","Soavg","Jrnbrrecruits","Jr5star","Jr4star","Rssr5star","Rssr4star","z_lysagarin","coachexp_school","coachexp_total")
sw<- lm(EPSNrank ~ Fr5star + Fr4star + So4star + Soavg + Jrnbrrecruits +
Jr5star + Jr4star + Rssr5star + Rssr4star + z_lysagarin +
coachexp_school + coachexp_total, data = dat)
preds<-predict(sw, newdata = newdat)
predset<-t(rbind(newdat$Team,newdat$EPSNrank,preds))
preddf<-as.data.frame(predset)
}
if(x=="ACC") {
dat<-subset(ACCsubset, ACCsubset$Year != y)
newdat<-subset(ACCsubset, ACCsubset$Year == y)
colnames(dat) <- c("Team","Year","EPSNrank", "FrNbrRecruits" , "Fr4star" , "Sonbrrecruits", "Jrnbrrecruits", "Jr5star", "Jr4star", "Jr3star", "Jravg", "Rssr4star", "z_lysagarin","retoff", "bowl" ,"coachexp_total")
fw<- lm(EPSNrank ~ FrNbrRecruits + Fr4star + Sonbrrecruits + Jrnbrrecruits +
Jr5star + Jr4star + Jr3star + Jravg + Rssr4star + z_lysagarin +
retoff + bowl + coachexp_total, data = dat)
preds<-predict(fw, newdata = newdat)
predset<-t(rbind(newdat$Team, newdat$EPSNrank,preds))
preddf<-as.data.frame(predset)
}
if(x=="BigTen"){
dat<-subset(Big10subset, Big10subset$Year != y)
newdat<-subset(Big10subset, Big10subset$Year == y)
colnames(dat) <- c("Team","Year","EPSNrank","FrNbrRecruits","Fr3star","So4star","Jrnbrrecruits","Jr5star","Jr4star","Jr3star","Jravg","Srnbrrecruits","Sr3star","z_lysagarin","coachexp_total")
bw <- lm(EPSNrank ~ FrNbrRecruits + Fr3star + So4star + Jrnbrrecruits +
Jr5star + Jr4star + Jr3star + Jravg + Srnbrrecruits + Sr3star +
z_lysagarin + coachexp_total, data = dat)
preds<-predict(bw, newdata = newdat)
predset<-t(rbind(newdat$Team,newdat$EPSNrank,preds))
preddf<-as.data.frame(predset)
}
if(x=="PacTen"){
dat<-subset(PAC10subset, PAC10subset$Year != y)
newdat<-subset(PAC10subset, PAC10subset$Year == y)
colnames(dat) <- c("Team","Year","EPSNrank", "FrNbrRecruits" , "Fr4star" , "Sonbrrecruits", "So4star", "Soavg", "Jrnbrrecruits" , "Jravg" , "Sr3star" , "Rssrnbrrecruits" , "Rssr5star" , "Rssr4star" , "Rssr3star" , "Rssravg" , "z_lysagarin" , "z_tyasagarin" , "retdef" , "qbret" , "coachexp_school" , "coachexp_total" ,
"Sravg")
bw <- lm(EPSNrank ~ FrNbrRecruits + Fr4star + Sonbrrecruits + So4star +
Soavg + Jrnbrrecruits + Jravg + Sr3star + Rssrnbrrecruits +
Rssr5star + Rssr4star + Rssr3star + Rssravg + z_lysagarin +
z_tyasagarin + retdef + qbret + coachexp_school + coachexp_total +
Sravg, data = dat)
preds<-predict(bw, newdata = newdat)
predset<-t(rbind(newdat$Team,newdat$EPSNrank,preds))
preddf<-as.data.frame(predset)
}
if(x =="BigTwelve"){
dat<-subset(Big12subset, Big12subset$Year != y)
newdat<-subset(Big12subset, Big12subset$Year == y)
colnames(dat) <- c("Team","Year","EPSNrank","FrNbrRecruits", "Fr4star", "Fr3star", "Fravg", "Sonbrrecruits","Soavg" , "Jrnbrrecruits" , "Jr5star" , "Srnbrrecruits" , "Rssrnbrrecruits" , "Rssravg" , "z_lysagarin" , "coachexp_school" , "coachexp_total")
fw<-lm(EPSNrank ~ FrNbrRecruits + Fr4star + Fr3star + Fravg + Sonbrrecruits +
Soavg + Jrnbrrecruits + Jr5star + Srnbrrecruits + Rssrnbrrecruits +
Rssravg + z_lysagarin + coachexp_school + coachexp_total, data = dat)
preds<-predict(fw, newdata = newdat)
predset<-t(rbind(newdat$Team, newdat$EPSNrank,preds))
preddf<-as.data.frame(predset)
}
preddf$V2<-as.numeric(as.character(preddf$V2))
preddf$preds<-as.numeric(as.character(preddf$preds))
return(preddf)
}
Next, we predicted the conference rankings for the 2018 season using the function we created above. The results are below. One limitation is that the function predicted the rankings as doubles instead of integers. Therefore, we just ordered the predictions least to greatest, as you can see below.
SEC<-predRank("SEC",2018)
ACC<-predRank("ACC",2018)
BigTen<-predRank("BigTen",2018)
PacTen<-predRank("PacTen",2018)
BigTwelve<-predRank("BigTwelve",2018)
SEC<-SEC[order(SEC$preds),]
SEC
V1 V2 preds
149 Georgia 2 0.7483953
151 LSU 4 3.8811658
145 Alabama 1 5.6032631
150 Kentucky 5 5.8993797
157 Texas A&M 6 7.1220021
154 Ole Miss 13 7.3828862
155 South Carolina 9 7.4111542
147 Auburn 10 7.8791257
153 Missouri 7 8.3805134
152 Mississippi State 8 9.0233377
156 Tennessee 12 9.3217204
146 Arkansas 14 10.2480061
158 Vanderbilt 11 10.4361920
148 Florida 3 11.4028449
ACC<-ACC[order(ACC$preds),]
ACC
V1 V2 preds
149 Miami-FL 8 1.834273
146 Florida State 12 1.880734
155 Virginia Tech 9 3.276114
150 NC State 4 4.534099
148 Louisville 14 4.671640
156 Wake Forest 11 4.825418
154 Virginia 7 4.908386
144 Clemson 1 4.975559
143 Boston College 6 5.566071
145 Duke 10 8.241210
147 Georgia Tech 5 9.086959
153 Syracuse 2 9.579264
151 North Carolina 13 10.772782
152 Pittsburgh 3 12.379287
BigTen<-BigTen[order(BigTen$preds),]
BigTen
V1 V2 preds
147 Penn State 4 -0.4226229
142 Michigan State 7 2.9731629
146 Ohio State 1 4.0818182
141 Michigan 2 5.3926006
139 Iowa 5 5.5232166
144 Nebraska 11 7.4059580
148 Purdue 8 8.8154271
145 Northwestern 3 8.9978007
143 Minnesota 9 9.5228485
137 Illinois 13 9.8637852
138 Indiana 12 10.1719705
140 Maryland 10 10.5101409
149 Rutgers 14 12.1588026
150 Wisconsin 6 NA
PacTen<-PacTen[order(PacTen$preds),]
PacTen
V1 V2 preds
135 Washington 2 4.498936
126 Arizona State 6 5.861611
125 Arizona 8 6.033062
131 Southern Cal 9 6.131612
127 California 7 6.304335
132 Stanford 4 7.533845
129 Oregon 5 7.898436
128 Colorado 11 8.350038
130 Oregon State 12 8.539411
136 Washington St. 1 8.610382
133 UCLA 10 8.979491
134 Utah 3 9.280296
BigTwelve<-BigTwelve[order(BigTwelve$preds),]
BigTwelve
V1 V2 preds
123 Oklahoma 1 3.572277
122 Kansas State 9 3.915932
124 Oklahoma State 7 4.489035
126 Texas 2 4.738465
120 Iowa State 4 5.826364
128 West Virginia 3 6.952349
127 Texas Tech 8 7.402782
125 TCU 6 8.590376
119 Baylor 5 9.810153
121 Kansas 10 10.416551
Our predictions weren’t very accuarate, however usually the predicted top ranked team is in the top 5 (ish).
We also used another method of predicting, decision trees. We wrote a very similar function to predict a given seasons conference rankings for each conference. The function takes in a string as a conference (i.e. “SEC”) and a year and predicts that conference’s rankings for that year, just like the one above but based on the decision tree model.
There is a new decision tree created for each conference. The process for this function is essentially the exact same as the one above, except decision trees are used to predict.
One thing to note about the tree diagram is that it predicts the ranking in factors essentially. Therefore a lot of “ties” show up in the predictions, as you will see below.


SECTree<-treePredRank("SEC",2018)
ACCTree<-treePredRank("ACC",2018)
BigTenTree<-treePredRank("BigTen",2018)

PacTenTree<-treePredRank("PacTen",2018)

BigTwelveTree<-treePredRank("BigTwelve",2018)

These predicitions are also not very accurate. However, the ties are interesting because you can see how teams differ in the post-season that were predicted to perform the same.
Overall, preseason statistics dont seem to be a very good predictor of post-season conference rankings, whether you are using multilple linear regression and variable screening methods or decision trees.
An important result we attained was how different conferences yield different predictors of success.
